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Nice Mathematical Properties 

• "Exponential" accuracy 

- Double the effective resolution compared to FD 

- Commonly used as baseline for comparison 

• Fast inversion of elliptic operators 

- Diagonal or nearly diagonal matrices 

- Enables efficient implicit time stepping 

• Natural boundary conditions 
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DYNAMO Goals 

• Existing runs are for n ~ 500-1000 

- Run on variety of dusters 

- Performance constrained by ID decomposition 

• 0(500 nodes) 

• Wasting lots of cores to get necessary memory 

• Limited to ~ X TF 

• Would like to achieve ~10x resolution 

- Needs at least petascale platform 

- Need consistent 2D (or even 3D) domain decomp 

• Stretch - implicit treatment of coriolis 



Pseudospectral Methods 

Complementary mathematical representations are used in different phases of 

calculations. 


Spatial derivatives computed 

Nonlinear products evaluated 

in frequency (spectral) 

in spatial domain 

domain: 

V 2 A(k ljk ) = |k k | 2 A(k iJk ) 

(“configuration space"): 





Each type of calculation is extremely simple and accurate in the appropriate 
domain. 


3 


4 





10/4/11 


10/4/11 


Spatial and spectral domains are connected via 
transforms: 


f , ,, 7] transform T~ 

Nk); <|>(x) 1 

Simple and efficient transforms are only possible in 
relatively small number of simple geometries. 

• Periodic box: FFT 0(n 3 log n) 

• Separable coordinates: 0(n 4 ) 

• Spherical shells are intermediate 

• General case is 0(n 6 ) - uncompetitive 



PS Methods on Spherical Shells 


• Spectral expansion based upon spherical harmonics 

- Yje,<j>) (m = 0, m max , I = 0 \ m jm 

• Radial expansion based upon Chebyshev polynomials 

- T n (r) (n = 0, .... n max ) 

• Poisson operators block diagonal (nontrivial coupling in radial) 
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Interesting Software Aspects 


• Elegant program structure 

- Sequence of transforms coupled by memory transposes 

- Software infrastructure plays major role 

• Non-recta ngular domains 

• Non-trSvial domain decomposition 

• Non -obvious data layout 

• Unique performance aspects 

- Different scaling properties: transpose vs. halo fill 

- Nothing to optimize! 


• FP workload largely in optimized libraries (FFT, DGEMM, ...) 

* All-to-ali is part of HPC benchmarks 


^jjj^ 
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Sketch of PS Implementation 



Domain Decomposition Strategy 
For a Given Transform 

• Co-locate transform axis 

- Perform operation "in processor" 

- Leverage serial implementation 

- Often available in optimized library (FFT, DGEMM) 

• Distribute remaining coordinates 

- No computational dependencies 

- Effective 2D decomposition 

• Balance/Optlmize 

- Memory/performance optimization by grouping along 
"independent" axis. 



Constraints on Decomposition 


Unfortunately ... 

• General anatomy of a transform 


* No decomposition scheme works for all cases 

- Acts on 1 axis of domain Q(i,j,k) -> Q'(m,j,k) 


- Usually need separate layout for each transform! 

- Independent of at least one axis 


- Sometimes can do 2 transforms on one layout if using a ID 

- Possibly parameterized by another axis 


decomposition 

* Example: Legendre Transform 


• Load balance is nontrivial for non-rectangular 

— Acts on meridional coordinate/degree 


domains. 

- Independent of radial coordinate 



- Parameterized by wavenumber coordinate 



i 
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Example: Legendre Transform 

* Transforms degree i to/from 8 (DGEMV) 

“ Independent of radial coordinate r 

- Parameterized by azimuthal wavenumber m 

* Decomposition constraints: 

- Keep /and 0 in processor 

- Distribute wavenumbers across processors. 

- Group r on processor to improve cache reuse (DGEMV -> 
DGEMM} 

• Split blocks of r across processors to balance scalability against 

serial performance. 




Implicit Update 

Couples radial coordinate (DGETRS) 

- Mostly independent of azimuthal wavenumber 

- Dependent on degree X, 

Strategy 

- Group wavenumber for BLAS2 -> BLASS 
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Abstractions 

Axis: Label(s) and coordinate indices 


Triangular domain 
treated as 1 D 
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Domain: Product of axes 

- Note, collection of fields can be an “axis” 

- DistributedDomain: subclass with “map” 

Field: Domain reference with array of values 



Going to Petascale 



Abstractions (cont'd) 


Transformer: 

- Produces: Decomposition constraints, Cost function 

- Field* 4 Field*, 

Balancer: 

- Ingest: Transformer, Auxiliary Coords, Communicator 

- Produces: fn/Out Domains, Instantiated Transformer 

Transposer: 

- Initialization: {Domain A , Domain B } 

- Field* -4 Field*, 



Specific Challenges 

• Accuracy 

• Transposes 

• Initialization 
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Representative Numbers 


Problem Size 


n - grid pts in 1 
dimension 


N - total grid points 
N. - number of iterations 


N l - Legendre per step 30 


N. - # core/node 


N n - # nodes 


11,000 


50,000 


a - network latency (ps) 
B - bisection bw (TB/s) 


Legendre 2 nd Derivative 
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Back of the Envelope 

Compute time dominated by Legendre DGEMM: 

_N d Njl_ 

Rfp 

Communication dominated by global transposes 
- Each transpose moves V* of data to other half of machine 


- Each process sends packet to p VD other processes (D=l,2,3) 

4n 3 

' B -' 1 


a'V' u + 



Desirable Characteristics 


Domination by computations: 

- 8-4000 

n > 

4 Rp 

B 

Bandwidth dominates latency: 

n<\ 


- 250 - 125000 


l 4 ) 

Complete in T s/m seconds 

n < 

(T,A „P 

- 4000 for 1 week turnaround 

\ , 

Efficient level 3 BLAS 
- -10000 

n » 

■p'“ 




Fast Initialization 

Distribution 1 

Distribution 2 



■ . ■ 

1 . Global (sample) sort is used to 



order metadata. 



• Complexity -0((N/p) log 2 p) 



• Use same pivots for both 



distributions 


yjp 1 I 

• Data tagged with original PE 


-- - B 

2 Each processor compares sorted 

■•■■•Mil w*u 


metadata to determine overlap. 


p -; - -‘B 

3 Overlap data is then “unsorted” 

nry - ijytJi 
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Petascale initialization 


^ Sorted metadata 

|L :-;4 1 

should require -10 minutes. 
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PGAS (CAF) 

• Efficient implementations of realistic global 
transposes are intricate and tedious in MP1. 

• PS at petascale requires exploration of a variety of 
strategies for spreading local and remote 
communications. 

• PGAS allows far simpler implementation and thus 
rapid exploration of variants. 




Conclusions 

Proper software abstractions should enable rapid- 
exploration of platform-specific optimizations/ 
tradeoffs. 

Pseudo-spectral methods are marginally viable for at 

least some classes of petascale problems. 

- A GPU based machine with good bisection would be best. 
Scalability at exascale is possible, but the necessary 
resolution will make algorithm prohibitively 
expensive. 
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